!pip install ipython-autotime
%load_ext autotime
Collecting ipython-autotime Downloading https://files.pythonhosted.org/packages/b4/c9/b413a24f759641bc27ef98c144b590023c8038dfb8a3f09e713e9dff12c1/ipython_autotime-0.3.1-py2.py3-none-any.whl Requirement already satisfied: ipython in /usr/local/lib/python3.6/dist-packages (from ipython-autotime) (5.5.0) Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (53.0.0) Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (4.8.0) Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (1.0.18) Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (2.6.1) Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (0.7.5) Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (4.3.3) Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (0.8.1) Requirement already satisfied: decorator in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (4.4.2) Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect; sys_platform != "win32"->ipython->ipython-autotime) (0.7.0) Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython->ipython-autotime) (1.15.0) Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython->ipython-autotime) (0.2.5) Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.2->ipython->ipython-autotime) (0.2.0) Installing collected packages: ipython-autotime Successfully installed ipython-autotime-0.3.1 time: 1.66 ms (started: 2021-02-11 06:59:29 +00:00)
# !pip install scikit-learn==0.24.1 # i installed for feature_permutation
time: 1.46 ms (started: 2021-02-11 06:59:29 +00:00)
!pip install yellowbrick==1.3
Collecting yellowbrick==1.3
Downloading https://files.pythonhosted.org/packages/d4/64/5e1cf10fb2ace980b71c992d1b84c807d8e69e9eddb389b35825b640ea48/yellowbrick-1.3-py3-none-any.whl (271kB)
|████████████████████████████████| 276kB 7.5MB/s
Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from yellowbrick==1.3) (3.2.2)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.6/dist-packages (from yellowbrick==1.3) (1.4.1)
Requirement already satisfied: scikit-learn>=0.20 in /usr/local/lib/python3.6/dist-packages (from yellowbrick==1.3) (0.22.2.post1)
Requirement already satisfied: numpy<1.20,>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from yellowbrick==1.3) (1.19.5)
Requirement already satisfied: cycler>=0.10.0 in /usr/local/lib/python3.6/dist-packages (from yellowbrick==1.3) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick==1.3) (2.8.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick==1.3) (2.4.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick==1.3) (1.3.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.20->yellowbrick==1.3) (1.0.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from cycler>=0.10.0->yellowbrick==1.3) (1.15.0)
Installing collected packages: yellowbrick
Found existing installation: yellowbrick 0.9.1
Uninstalling yellowbrick-0.9.1:
Successfully uninstalled yellowbrick-0.9.1
Successfully installed yellowbrick-1.3
time: 3.44 s (started: 2021-02-11 06:59:29 +00:00)
from yellowbrick.model_selection import FeatureImportances #
time: 1.58 ms (started: 2021-02-11 09:08:04 +00:00)
# feature selection : pca
# hyper parameter tuning
# plot importance features
# medical diagnosis, spam filtering, and fraud detection.
# credit card fraud detection dataset
# Most machine learning algorithms work best when the number of samples in each class are about equal
# lr = LogisticRegression(solver='liblinear').fit(X_train, y_train)
# what is liblinear ?
# dummy classifier = most frequent
# confusion matrix , f1, precession , recall
# Low precision indicates a high number of false positives.
# Low recall indicates a high number of false negatives.
# Decision trees frequently perform well on imbalanced data
# random forest better than logistic with f1 and recall : mayber i can use pca with random forest
# over sampling after split.
# mybe cause an overfit
# Our recall score increased, but F1 is much lower : but in smote : f1 increase bun recall similar to before
# than with either our baseline logistic regression or random forest from above.
# Let’s see if undersampling might perform better here.
# so when we use pca we then check f1 and precision and recall!
# i want to combine them !
# under sampling for million records and cause under sampling
# Again, it’s important to generate the new samples only in the training set to ensure our model generalizes well to unseen data.
# sooooo use smote just in train
time: 3.92 ms (started: 2021-02-11 09:08:04 +00:00)
!pip install solarsystem
import numpy as np
import pandas as pd
import solarsystem
from sklearn.linear_model import LogisticRegressionCV
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt
import math
from sklearn.metrics import accuracy_score, classification_report, SCORERS
%matplotlib inline
path=('/content/Iran1900-2020.xlsx')
df = pd.read_excel(path)
df_sheet0=pd.read_excel('Iran1900-2020.xlsx',sheet_name=0)
df_sheet1=pd.read_excel('Iran1900-2020.xlsx',sheet_name=1)
Requirement already satisfied: solarsystem in /usr/local/lib/python3.6/dist-packages (0.1.5) time: 15.4 s (started: 2021-02-11 09:08:05 +00:00)
df_sheet0.head()
| Date | Time | Lat. | Long. | Depth | Mag. | City | Province | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2006/01/01 | 03:08:08.7 | 33.962 | 48.661 | 6.0 | 3.3 | Eshtarinan | Lorestan |
| 1 | 2006/01/01 | 07:38:03.7 | 33.936 | 48.699 | 4.0 | 3.4 | Borojerd | Lorestan |
| 2 | 2006/01/01 | 12:36:48.5 | 38.033 | 48.336 | 6.0 | 2.6 | Hir | Ardebil |
| 3 | 2006/01/01 | 13:35:15.8 | 38.239 | 57.172 | 16.3 | 2.6 | Bäherden | Turkmenistan |
| 4 | 2006/01/01 | 13:54:29.2 | 37.943 | 48.277 | 10.0 | 2.7 | Hir | Ardebil |
time: 29.9 ms (started: 2021-02-11 09:08:21 +00:00)
df_sheet1.head()
| Year | Month | Day | Hour | Lat | Long | Mag | |
|---|---|---|---|---|---|---|---|
| 0 | 1900 | 6.0 | NaN | NaN | 38.50 | 43.30 | 5.0 |
| 1 | 1900 | 7.0 | 12.0 | NaN | 40.28 | 43.10 | 5.9 |
| 2 | 1902 | 2.0 | 13.0 | 93906.0 | 40.72 | 48.71 | 6.0 |
| 3 | 1902 | 2.0 | 21.0 | NaN | 41.80 | 48.80 | 5.6 |
| 4 | 1902 | 7.0 | 9.0 | 338.0 | 27.08 | 56.34 | 6.4 |
time: 28.5 ms (started: 2021-02-11 09:08:24 +00:00)
date=df_sheet0['Date']
year=[]
mont=[]
day=[]
for i in date:
year.append(i[0:4])
mont.append(i[5:7])
day.append(i[8:10])
#print(df_sheet0)
#print(year)
#print(mont)
#print(day)
time=df_sheet0['Time']
hour=[]
for i in time:
hour.append(i[0:2])
#print(hour)
lat=df_sheet0['Lat.']
long=df_sheet0['Long.']
mag=df_sheet0['Mag.']
df_sh2={'Year' : year , 'Month' : mont , 'Day' : day ,
'Hour' : hour ,'Lat' : lat , 'Long' : long , 'Mag': mag}
df_sheet0=pd.DataFrame(df_sh2)
#print(df_sheet0)
Df=[df_sheet1 , df_sheet0]
df=pd.concat(Df,ignore_index=True, sort=False)
time: 116 ms (started: 2021-02-11 09:08:25 +00:00)
print(df)
# حالا کل دادها کنارهم هستند سعی میکنیم دادها خالی را ک در ساعت هستند را عدد دیفالت 12 بگزاریم
#و سمپلهایی که در روز و ماه خالی هستند را حذف میکنیم
Year Month Day Hour Lat Long Mag 0 1900 6 NaN NaN 38.500 43.300 5.0 1 1900 7 12 NaN 40.280 43.100 5.9 2 1902 2 13 93906 40.720 48.710 6.0 3 1902 2 21 NaN 41.800 48.800 5.6 4 1902 7 9 338 27.080 56.340 6.4 ... ... ... ... ... ... ... ... 54088 2020 10 12 12 37.983 56.022 2.6 54089 2020 10 12 16 31.784 47.728 2.8 54090 2020 10 12 18 38.843 43.643 2.8 54091 2020 10 13 11 32.158 56.278 2.5 54092 2020 10 13 13 33.929 48.570 3.1 [54093 rows x 7 columns] time: 23.6 ms (started: 2021-02-11 09:08:27 +00:00)
# delete
# del df["Mag"]
time: 2.09 ms (started: 2021-02-11 09:08:28 +00:00)
df.columns
Index(['Year', 'Month', 'Day', 'Hour', 'Lat', 'Long', 'Mag'], dtype='object')
time: 8.66 ms (started: 2021-02-11 09:08:29 +00:00)
df.isnull().any()
Year False Month True Day True Hour True Lat False Long False Mag False dtype: bool
time: 26.9 ms (started: 2021-02-11 09:08:30 +00:00)
len(df)
54093
time: 3.7 ms (started: 2021-02-11 09:08:30 +00:00)
df['Hour'].fillna(12, inplace=True) #
df=df.dropna()
df = df.reset_index()
del df['index']
time: 59.3 ms (started: 2021-02-11 09:08:31 +00:00)
len(df) # after delete
54088
time: 5.68 ms (started: 2021-02-11 09:08:32 +00:00)
#اگر روز و ماه عددی نامعتبر گرقته اند را حذف میکنیم
for x in df.index:
if (int(df.loc[x, "Month"]) > 12 ) or(int(df.loc[x, "Month"])<1 ):
df.drop(x, inplace = True)
if (int(df.loc[x, "Day"]) > 31 ) or(int(df.loc[x, "Day"])<1 ):
df.drop(x, inplace = True)
df = df.reset_index()
del df['index']
time: 1.85 s (started: 2021-02-11 09:08:32 +00:00)
len(df)
54088
time: 4.09 ms (started: 2021-02-11 09:08:34 +00:00)
#رفع مشکل ساعتها
for x in df.index:
h=str(int(df.loc[x, "Hour"]))
if (len(h)==1):
continue
elif ( (len(h)==2) and int(h)<25 ):
continue
elif ( (len(h)==2) and int(h)>24 ):
df.loc[x, "Hour"]=h[0]
elif (int(h[0:2])<25):
df.loc[x, "Hour"]=int(h[0:2])
elif (int(h[0:2])>24):
df.loc[x, "Hour"]=int(h[0])
print(df)
Year Month Day Hour Lat Long Mag 0 1900 7 12 12 40.280 43.100 5.9 1 1902 2 13 9 40.720 48.710 6.0 2 1902 2 21 12 41.800 48.800 5.6 3 1902 7 9 3 27.080 56.340 6.4 4 1902 9 5 4 39.500 48.000 4.8 ... ... ... .. ... ... ... ... 54083 2020 10 12 12 37.983 56.022 2.6 54084 2020 10 12 16 31.784 47.728 2.8 54085 2020 10 12 18 38.843 43.643 2.8 54086 2020 10 13 11 32.158 56.278 2.5 54087 2020 10 13 13 33.929 48.570 3.1 [54088 rows x 7 columns] time: 13.7 s (started: 2021-02-11 09:08:34 +00:00)
len(df)
54088
time: 4.71 ms (started: 2021-02-11 09:08:48 +00:00)
df.columns
Index(['Year', 'Month', 'Day', 'Hour', 'Lat', 'Long', 'Mag'], dtype='object')
time: 4.89 ms (started: 2021-02-11 09:08:48 +00:00)
def norm(year,month,day,hour,planet):
E=solarsystem.geocentric.Geocentric(year=int (year), month=int (month), day=int (day), hour=int (hour), minute=0 )
e=E.position()
p=e[planet]
Np=np.sqrt(np.power(p[0],2)+np.power(p[1],2))
return Np
A = solarsystem.geocentric.Geocentric(year=2020, month=1, day=1, hour=12, minute=0 )
planet_names = A.objectnames()
def LONG_LAT(df,planet_names):
#طول و عرض جغرافیایی و فاصله تا زمین را به ویژگی هامان اضاف میکنیم
for planet in planet_names:
longit = []
latit = []
dis = []
for x in df.index:
E = solarsystem.geocentric.Geocentric(year=int(df.loc[x, "Year"]), month=int(df.loc[x, "Month"]), day=int(df.loc[x, 'Day']), hour=int(df.loc[x, "Hour"]), minute=0, UT=1,dst=1 )
e = E.position()
p = e[planet]
longit.append(p[0])
latit.append(p[1])
dis.append(p[2])
df['longit'+ planet] = longit
df['lait'+ planet] = longit
df['dic'+ planet] = dis
LONG_LAT(df,planet_names)
def MOON(df):
longit_m=[]
latit_m=[]
dis_m=[]
for x in df.index:
M=solarsystem.moon.Moon(year=int(df.loc[x, "Year"]), month=int(df.loc[x, "Month"]), day=int(df.loc[x, 'Day']), hour=int(df.loc[x, "Hour"]), minute=0 ,UT=1,dst=1 )
m=M.position()
longit_m.append(m[0])
latit_m.append(m[1])
dis_m.append(m[2])
df['longitM']=longit_m
df['latitM']=latit_m
df['disM']=dis_m
t=[]
MOON(df)
print ("dataset and feature\n\n")
print(df)
dataset and feature
Year Month Day Hour ... dicEris longitM latitM disM
0 1900 7 12 12 ... 91.950144 287.688144 3.109302 58.267792
1 1902 2 13 9 ... 93.254009 23.390379 1.236983 58.428715
2 1902 2 21 12 ... 93.311270 137.643550 -4.928150 59.182909
3 1902 7 9 3 ... 92.283644 153.617498 -4.224502 58.655810
4 1902 9 5 4 ... 91.785900 197.076035 -0.799823 60.847155
... ... ... .. ... ... ... ... ... ...
54083 2020 10 12 12 ... 94.962338 138.912518 4.234640 58.679271
54084 2020 10 12 16 ... 94.962146 141.230176 4.345465 58.518116
54085 2020 10 12 18 ... 94.962052 142.394206 4.398489 58.438187
54086 2020 10 13 11 ... 94.961342 152.426628 4.778735 57.783276
54087 2020 10 13 13 ... 94.961268 153.622832 4.814540 57.709863
[54088 rows x 46 columns]
time: 1min 53s (started: 2021-02-11 09:08:48 +00:00)
def LABEL(df):
label=[]
for x in df.index:
if df.loc[x, "Mag"] >=4.5:
label.append(1)
else:
label.append(0)
df['label']=label
LABEL(df)
print(df.head(10))
Year Month Day Hour ... longitM latitM disM label 0 1900 7 12 12 ... 287.688144 3.109302 58.267792 1 1 1902 2 13 9 ... 23.390379 1.236983 58.428715 1 2 1902 2 21 12 ... 137.643550 -4.928150 59.182909 1 3 1902 7 9 3 ... 153.617498 -4.224502 58.655810 1 4 1902 9 5 4 ... 197.076035 -0.799823 60.847155 1 5 1902 10 3 23 ... 215.156529 0.887864 61.851637 1 6 1902 10 4 14 ... 222.900628 1.586516 62.212634 1 7 1902 10 17 7 ... 22.312504 0.305154 58.124458 1 8 1902 10 26 11 ... 152.874054 -4.173225 59.549699 1 9 1902 12 2 4 ... 270.711501 4.722300 63.648217 1 [10 rows x 47 columns] time: 574 ms (started: 2021-02-11 09:10:42 +00:00)
# delete
del df["Mag"]
time: 4.75 ms (started: 2021-02-11 09:10:42 +00:00)
df.columns
Index(['Year', 'Month', 'Day', 'Hour', 'Lat', 'Long', 'longitSun', 'laitSun',
'dicSun', 'longitMercury', 'laitMercury', 'dicMercury', 'longitVenus',
'laitVenus', 'dicVenus', 'longitMars', 'laitMars', 'dicMars',
'longitJupiter', 'laitJupiter', 'dicJupiter', 'longitSaturn',
'laitSaturn', 'dicSaturn', 'longitUranus', 'laitUranus', 'dicUranus',
'longitNeptune', 'laitNeptune', 'dicNeptune', 'longitPluto',
'laitPluto', 'dicPluto', 'longitCeres', 'laitCeres', 'dicCeres',
'longitChiron', 'laitChiron', 'dicChiron', 'longitEris', 'laitEris',
'dicEris', 'longitM', 'latitM', 'disM', 'label'],
dtype='object')
time: 4.08 ms (started: 2021-02-11 09:10:42 +00:00)
len(df.columns)
46
time: 3.35 ms (started: 2021-02-11 09:11:25 +00:00)
featur_name=[]
for col in df.columns:
if col=='Year' or col=='Month' or col=='Day' or col=='Hour' or col=='label' :
# or col=='Lat' or col=='Long' or col=='Mag' :
pass
else:
print(col )
featur_name.append(col)
Lat Long longitSun laitSun dicSun longitMercury laitMercury dicMercury longitVenus laitVenus dicVenus longitMars laitMars dicMars longitJupiter laitJupiter dicJupiter longitSaturn laitSaturn dicSaturn longitUranus laitUranus dicUranus longitNeptune laitNeptune dicNeptune longitPluto laitPluto dicPluto longitCeres laitCeres dicCeres longitChiron laitChiron dicChiron longitEris laitEris dicEris longitM latitM disM time: 8.78 ms (started: 2021-02-11 09:11:26 +00:00)
print(featur_name)
print(df["label"])
['Lat', 'Long', 'longitSun', 'laitSun', 'dicSun', 'longitMercury', 'laitMercury', 'dicMercury', 'longitVenus', 'laitVenus', 'dicVenus', 'longitMars', 'laitMars', 'dicMars', 'longitJupiter', 'laitJupiter', 'dicJupiter', 'longitSaturn', 'laitSaturn', 'dicSaturn', 'longitUranus', 'laitUranus', 'dicUranus', 'longitNeptune', 'laitNeptune', 'dicNeptune', 'longitPluto', 'laitPluto', 'dicPluto', 'longitCeres', 'laitCeres', 'dicCeres', 'longitChiron', 'laitChiron', 'dicChiron', 'longitEris', 'laitEris', 'dicEris', 'longitM', 'latitM', 'disM']
0 1
1 1
2 1
3 1
4 1
..
54083 0
54084 0
54085 0
54086 0
54087 0
Name: label, Length: 54088, dtype: int64
time: 4.61 ms (started: 2021-02-11 09:11:27 +00:00)
df["label"].value_counts() #
0 49626 1 4462 Name: label, dtype: int64
time: 10.5 ms (started: 2021-02-11 09:11:28 +00:00)
X = df.loc[:, featur_name].to_numpy()
y = df.loc[:, ['label']].to_numpy()
scalar = StandardScaler() # i change it
X = scalar.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=2020)
time: 135 ms (started: 2021-02-11 09:11:29 +00:00)
X.shape
(54088, 41)
time: 5.21 ms (started: 2021-02-11 09:11:30 +00:00)
y.shape
(54088, 1)
time: 6.87 ms (started: 2021-02-11 09:11:30 +00:00)
type(y)
numpy.ndarray
time: 4.75 ms (started: 2021-02-11 09:11:32 +00:00)
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
time: 1.28 ms (started: 2021-02-11 09:11:36 +00:00)
train_data = pd.DataFrame(np.column_stack((X,y)))
# Count classes and plot
target_count = train_data.iloc[:,-1].value_counts()
print('Class 0:', target_count[0])
print('Class 1:', target_count[1])
target_count.plot(kind='bar', title='Count (target)');
Class 0: 49626 Class 1: 4462
time: 169 ms (started: 2021-02-11 09:11:38 +00:00)
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from scipy.stats import itemfreq
# Remove column
labels = train_data.columns[:-1]
X = train_data[labels]
y = train_data.iloc[:,-1]
# Split the data into train:test @ ratio 80:20
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=1)
# Define the classifier model
model = SGDClassifier(max_iter=1000, tol=1e-3)
# Fit data to the model (train)
model.fit(X_train, y_train)
# Predict results of test data
y_pred_raw = model.predict(X_test)
# Get accuracy score
accuracy = accuracy_score(y_test, y_pred_raw)
print('Accuracy: %.2f%%' % (accuracy * 100.0))
# Show predicted classes
print(itemfreq(y_pred_raw))
Accuracy: 93.85% [[0.000e+00 9.989e+03] [1.000e+00 8.290e+02]] time: 718 ms (started: 2021-02-11 09:11:41 +00:00)
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:25: DeprecationWarning: `itemfreq` is deprecated!
`itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
# y_pred_raw
# what is itemfreq ???
time: 1.66 ms (started: 2021-02-11 09:11:41 +00:00)
# Fit single feature data to the model (train)
model.fit(X_train.iloc[:,-1].values.reshape(-1, 1), y_train)
# Predict results of test data
y_pred_single = model.predict(
X_test.iloc[:,-1].values.reshape(-1, 1))
# Get accuracy score
accuracy = accuracy_score(y_test, y_pred_single)
print('Accuracy: %.2f%%' % (accuracy * 100.0))
# Show predicted classes
print(itemfreq(y_pred_single))
Accuracy: 91.51% [[ 0. 10818.]] time: 55.1 ms (started: 2021-02-11 09:11:42 +00:00)
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:10: DeprecationWarning: `itemfreq` is deprecated!
`itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
# Remove the CWD from sys.path while we load stuff.
from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt
# Build confusion matrix from ground truth labels and model predictions
conf_mat = confusion_matrix(y_true=y_test, y_pred=y_pred_single)
print('Confusion matrix:\n', conf_mat)
# Plot matrix
plt.matshow(conf_mat)
plt.colorbar()
plt.ylabel('Real Class')
plt.xlabel('Predicted Class')
plt.show()
Confusion matrix: [[9900 0] [ 918 0]]
time: 409 ms (started: 2021-02-11 09:11:43 +00:00)
# Redefine classifier model to use class_weight
model = SGDClassifier(max_iter=1000,
tol=1e-3,
class_weight='balanced')
# Train
model.fit(X_train, y_train)
# Predict
y_pred_wtd = model.predict(X_test)
# Get accuracy score
accuracy = accuracy_score(y_test, y_pred_wtd)
print('Accuracy: %.2f%%' % (accuracy * 100.0))
print(classification_report(y_test,y_pred_wtd))
print(itemfreq(y_pred_wtd))
# Build confusion matrix
conf_mat = confusion_matrix(y_true=y_test, y_pred=y_pred_wtd)
print('Confusion matrix:\n', conf_mat)
# Plot matrix
plt.matshow(conf_mat)
plt.colorbar()
plt.ylabel('Real Class')
plt.xlabel('Predicted Class')
plt.show()
Accuracy: 91.31%
precision recall f1-score support
0.0 0.97 0.93 0.95 9900
1.0 0.49 0.73 0.59 918
accuracy 0.91 10818
macro avg 0.73 0.83 0.77 10818
weighted avg 0.93 0.91 0.92 10818
[[0.000e+00 9.450e+03]
[1.000e+00 1.368e+03]]
Confusion matrix:
[[9205 695]
[ 245 673]]
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:15: DeprecationWarning: `itemfreq` is deprecated!
`itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
from ipykernel import kernelapp as app
time: 1.35 s (started: 2021-02-11 09:11:45 +00:00)
# Class count
target_1_count, target_0_count=train_data.iloc[:,-1].value_counts()
# Seperate classes
target_0 = train_data[train_data.iloc[:,-1] == 0]
target_1 = train_data[train_data.iloc[:,-1] == 1]
# Resample target1 to match target 0 count
target_1_undersample = target_1.sample(target_0_count)
# Merge back to single df
test_undersample = pd.concat([target_1_undersample, target_0],
axis=0)
# Show counts and plot
print('Random under-sampling:')
print(test_undersample.iloc[:,-1].value_counts())
test_undersample.iloc[:,-1].value_counts().plot(kind='bar', title='Count (target)');
Random under-sampling: 0.0 49626 1.0 4462 Name: 41, dtype: int64
time: 212 ms (started: 2021-02-11 09:11:47 +00:00)
# Resample target0 to match target 1 count
target_0_oversample = target_0.sample(target_1_count, replace=True)
# Merge back to single df
test_oversample = pd.concat([target_0_oversample, target_1], axis=0)
# Show counts and plot
print('Random over-sampling:')
print(test_oversample.iloc[:,-1].value_counts())
test_oversample.iloc[:,-1].value_counts().plot(kind='bar', title='Count (target)');
Random over-sampling: 0.0 49626 1.0 4462 Name: 41, dtype: int64
time: 192 ms (started: 2021-02-11 09:11:50 +00:00)
X.shape
(54088, 41)
time: 5.14 ms (started: 2021-02-11 09:11:52 +00:00)
type(X)
pandas.core.frame.DataFrame
time: 7.73 ms (started: 2021-02-11 09:11:54 +00:00)
y.shape
(54088,)
time: 3.17 ms (started: 2021-02-11 09:11:56 +00:00)
type(y)
pandas.core.series.Series
time: 9.92 ms (started: 2021-02-11 09:12:00 +00:00)
from sklearn.decomposition import PCA
# Define PCA model, specifying 2 dimensions
pca = PCA(n_components=2)
# Fit data
pca = pca.fit(X) # i add it
X_2d = pca.transform(X)
# Plot helper function
def draw_plot(X, y, label):
for l in np.unique(y):
plt.scatter(
X[y==l, 0],
X[y==l, 1],
label=l
)
plt.title(label)
plt.legend()
plt.show()
# plot raw PCA
draw_plot(X_2d, y, 'Raw Data')
time: 1.27 s (started: 2021-02-11 09:12:43 +00:00)
type(X_2d)
numpy.ndarray
time: 3.26 ms (started: 2021-02-11 09:12:47 +00:00)
from imblearn.over_sampling import SMOTE
# Define SMOTE model and specify minority class for oversample
smote = SMOTE(ratio='minority', k_neighbors=4)
# Fit data
X_smote, y_smote = smote.fit_sample(X_2d, y)
# Plot
draw_plot(X_smote, y_smote, 'SMOTE')
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24. warnings.warn(msg, category=FutureWarning)
time: 1.23 s (started: 2021-02-11 09:12:50 +00:00)
from imblearn.over_sampling import ADASYN
# Define SMOTE model and specify minority class for oversample
adasyn = ADASYN(ratio='minority', n_neighbors=4)
# Fit Data
X_adasyn, y_adasyn = adasyn.fit_sample(X_2d, y)
# Plot
draw_plot(X_adasyn, y_adasyn, 'ADASYN')
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24. warnings.warn(msg, category=FutureWarning)
time: 1.35 s (started: 2021-02-11 09:12:54 +00:00)
from imblearn.under_sampling import TomekLinks
from collections import Counter
# Define model
tome = TomekLinks(return_indices=True, ratio='auto', random_state=42)
# Fit data
X_tome, y_tome, id_tome = tome.fit_sample(X_2d, y)
# Find removed indices
idx_samples_removed = np.setdiff1d(np.arange(X_2d.shape[0]),id_tome)
# Show result
print('Removed indexes:', idx_samples_removed)
print('Original dataset shape {}'.format(Counter(y)))
print('Resampled dataset shape {}'.format(Counter(y_tome)))
draw_plot(X_tome, y_tome, 'Tomek links under-sampling')
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24. warnings.warn(msg, category=FutureWarning)
Removed indexes: [ 40 47 83 ... 53625 53637 54045]
Original dataset shape Counter({0.0: 49626, 1.0: 4462})
Resampled dataset shape Counter({0.0: 48244, 1.0: 4462})
time: 812 ms (started: 2021-02-11 09:12:58 +00:00)
from imblearn.combine import SMOTETomek
# Define model using previous SMOTE model
smto = SMOTETomek(ratio='auto', smote=smote)
# Fit data
X_smto, y_smto = smto.fit_sample(X_2d, y)
# Plot
draw_plot(X_smto, y_smto, 'SMOTE + TOMEK')
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24. warnings.warn(msg, category=FutureWarning) /usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24. warnings.warn(msg, category=FutureWarning)
time: 1.57 s (started: 2021-02-11 09:13:02 +00:00)
type(X_smto)
numpy.ndarray
time: 3.04 ms (started: 2021-02-11 09:13:06 +00:00)
X_smto.shape
(94428, 2)
time: 5.72 ms (started: 2021-02-11 09:13:07 +00:00)
type(y_smto)
numpy.ndarray
time: 7.35 ms (started: 2021-02-11 09:13:09 +00:00)
y_smto.shape
(94428,)
time: 4.03 ms (started: 2021-02-11 09:13:10 +00:00)
# Create train: test split for new data
X_train_st, X_test_st, y_train_st, y_test_st = train_test_split(X_smto, y_smto, test_size=0.2, random_state=1)
# Define model
model = SGDClassifier(max_iter=1000,
tol=1e-3,
class_weight='balanced')
# Fit data
model.fit(X_train_st, y_train_st)
# Predict
y_pred_st = model.predict(X_test_st)
# Get accuracy score
accuracy = accuracy_score(y_test_st, y_pred_st)
# Build confusion matrix
conf_mat = confusion_matrix(y_true=y_test_st, y_pred=y_pred_st)
# Output results
print('Accuracy: %.2f%%' % (accuracy * 100.0))
print(classification_report(y_test_st,y_pred_st))
print(itemfreq(y_pred_st))
print('Confusion matrix:\n', conf_mat)
plt.matshow(conf_mat)
plt.colorbar()
plt.ylabel('Real Class')
plt.xlabel('Predicted Class')
plt.show()
Accuracy: 77.08%
precision recall f1-score support
0.0 0.84 0.68 0.75 9491
1.0 0.73 0.87 0.79 9395
accuracy 0.77 18886
macro avg 0.78 0.77 0.77 18886
weighted avg 0.78 0.77 0.77 18886
[[0.0000e+00 7.6730e+03]
[1.0000e+00 1.1213e+04]]
Confusion matrix:
[[6418 3073]
[1255 8140]]
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:20: DeprecationWarning: `itemfreq` is deprecated!
`itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
time: 563 ms (started: 2021-02-11 09:18:55 +00:00)
# Create train: test split for new data
X_train_st, X_test_st, y_train_st, y_test_st = train_test_split(X_adasyn, y_adasyn, test_size=0.2, random_state=1)
# Define model
model = SGDClassifier(max_iter=1000,
tol=1e-3,
class_weight='balanced')
# Fit data
model.fit(X_train_st, y_train_st)
# Predict
y_pred_st = model.predict(X_test_st)
# Get accuracy score
accuracy = accuracy_score(y_test_st, y_pred_st)
# Build confusion matrix
conf_mat = confusion_matrix(y_true=y_test_st, y_pred=y_pred_st)
# Output results
print('Accuracy: %.2f%%' % (accuracy * 100.0))
print(classification_report(y_test_st,y_pred_st))
print(itemfreq(y_pred_st))
print('Confusion matrix:\n', conf_mat)
plt.matshow(conf_mat)
plt.colorbar()
plt.ylabel('Real Class')
plt.xlabel('Predicted Class')
plt.show()
Accuracy: 70.44%
precision recall f1-score support
0.0 0.72 0.66 0.69 9976
1.0 0.69 0.75 0.72 9950
accuracy 0.70 19926
macro avg 0.71 0.70 0.70 19926
weighted avg 0.71 0.70 0.70 19926
[[0.0000e+00 9.1180e+03]
[1.0000e+00 1.0808e+04]]
Confusion matrix:
[[6602 3374]
[2516 7434]]
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:20: DeprecationWarning: `itemfreq` is deprecated!
`itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
time: 577 ms (started: 2021-02-11 09:13:34 +00:00)
time: 1.06 ms (started: 2021-02-11 09:13:37 +00:00)
logReg_clf = LogisticRegression(max_iter=100000)
logReg_clf.fit(X_train_st, y_train_st)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100000,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
time: 123 ms (started: 2021-02-11 09:13:40 +00:00)
def evaluator(model, test_features, test_labels):
predictions = model.predict(test_features)
print('Model Performance', "\n")
score = accuracy_score(test_labels, predictions)
print('Accuracy: ', round(score, 5), "\n")
confusion_matrix = pd.crosstab(test_labels, predictions,
rownames=['Actual'], colnames=['Predicted'])
print(confusion_matrix)
plt.figure(figsize=(10, 6))
sns.heatmap(confusion_matrix, annot=True)
print('\nClassification Report:')
print(classification_report(test_labels, predictions))
time: 7.26 ms (started: 2021-02-11 09:13:41 +00:00)
evaluator(logReg_clf, X_test_st, y_test_st)
Model Performance
Accuracy: 0.71013
Predicted 0.0 1.0
Actual
0.0 6790 3186
1.0 2590 7360
Classification Report:
precision recall f1-score support
0.0 0.72 0.68 0.70 9976
1.0 0.70 0.74 0.72 9950
accuracy 0.71 19926
macro avg 0.71 0.71 0.71 19926
weighted avg 0.71 0.71 0.71 19926
time: 380 ms (started: 2021-02-11 09:13:44 +00:00)
'''
year= 2006
month=2
day= 28
hour=7
lat = 28.117
lang = 56.759
'''
year= 2021
month=2
day= 5
hour=17
lat = 25.88
lang = 59.19
df_p={'Year' : [year] , 'Month' : [month] , 'Day' : [day] ,
'Hour' : [hour] , "Lat" : [lat] , 'Long' : [lang] }
dfp=pd.DataFrame(df_p)
print(dfp ,'\n\n\n')
LONG_LAT(dfp,planet_names)
MOON(dfp)
Year Month Day Hour Lat Long 0 2021 2 5 17 25.88 59.19 time: 52.4 ms (started: 2021-02-11 11:47:18 +00:00)
X_test = dfp.loc[:, featur_name].to_numpy()
# X_test = StandardScaler().fit_transform(X_test)
temp= scalar.transform(X_test)
X_test = temp
X_2d = pca.transform(X_test)
# X_test
#X_train, X_test, y_train, y_test = train_test_split(
# X, y, test_size=0.2, random_state=2020)
# X_test = X_test.reshape(-1,1)
time: 9.7 ms (started: 2021-02-11 11:47:19 +00:00)
# model train sgd classifier with smotetomek
model.fit(X_smto, y_smto)
# Predict
predictions_2 = model.predict(X_2d) # from sgd classifier with smotetomek
# train simple logistic
logReg_clf.fit(X_smto, y_smto)
predictions_1 = logReg_clf.predict(X_2d) # from logistic classifier with smotetomek
X = df.loc[:, featur_name].to_numpy()
y = df.loc[:, ['label']].to_numpy()
X = scalar.transform(X)
logReg_clf.fit(X, y)
predictions_3 = logReg_clf.predict(X_test) # without smote
print(predictions_3)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
[0] time: 1.43 s (started: 2021-02-11 11:47:21 +00:00)
#predictions_1 = logReg_clf.predict(X_2d) # from logistic classifier with smotetomek
#evaluator(logReg_clf, X_test, y_test)
print(predictions_1)
[0.] time: 1.42 ms (started: 2021-02-11 11:47:23 +00:00)
#predictions_2 = model.predict(X_2d) # from sgd classifier with smotetomek
print(predictions_2)
[0.] time: 2.28 ms (started: 2021-02-11 11:47:27 +00:00)
# predictions_3 = logReg_clf.predict(X_2d) # without smote
print(predictions_3)
[0] time: 2.15 ms (started: 2021-02-11 11:47:29 +00:00)
if (predictions_1 + predictions_2 + predictions_3) >= 2 :
print("label is : ", 1)
else :
print("label is : ", 0)
label is : 0 time: 7.97 ms (started: 2021-02-11 11:47:31 +00:00)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
# visualisations
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style("whitegrid")
sns.set(rc = {'figure.figsize':(15, 10)})
time: 6.42 ms (started: 2021-02-11 00:06:34 +00:00)
I define a few helper functions to make analysis more convenient and presentable.
# udfs ----
# function for creating a feature importance dataframe
def imp_df(column_names, importances):
df = pd.DataFrame({'feature': column_names,
'feature_importance': importances}) \
.sort_values('feature_importance', ascending = False) \
.reset_index(drop = True)
return df
# plotting a feature importance dataframe (horizontal barchart)
def var_imp_plot(imp_df, title):
imp_df.columns = ['feature', 'feature_importance']
sns.barplot(x = 'feature_importance', y = 'feature', data = imp_df, orient = 'h', color = 'royalblue') \
.set_title(title, fontsize = 20)
time: 4.07 ms (started: 2021-02-11 00:06:36 +00:00)
print(type(y))
print(type(X))
print(y.shape)
print(X.shape)
<class 'numpy.ndarray'> <class 'pandas.core.frame.DataFrame'> (506,) (506, 14) time: 3.44 ms (started: 2021-02-10 23:16:10 +00:00)
X = df.loc[:, featur_name] # .to_numpy()
y = df.loc[:, ['label']].to_numpy()
# X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=2020)
time: 33.9 ms (started: 2021-02-11 00:07:12 +00:00)
X_train.shape
(43270, 41)
time: 4.82 ms (started: 2021-02-11 10:08:58 +00:00)
#type(StandardScaler().fit_transform(X))
time: 876 µs (started: 2021-02-11 00:07:14 +00:00)
y = y.squeeze()
time: 949 µs (started: 2021-02-11 00:07:16 +00:00)
print(type(y))
print(type(X))
print(y.shape)
print(X.shape)
<class 'numpy.ndarray'> <class 'pandas.core.frame.DataFrame'> (54088,) (54088, 41) time: 2.92 ms (started: 2021-02-11 00:07:18 +00:00)
sns.heatmap(X.assign(target = y).corr().round(2), cmap = 'Blues', annot = True).set_title('Correlation matrix', fontsize = 16)
Text(0.5, 1.0, 'Correlation matrix')
time: 7.09 s (started: 2021-02-11 00:07:27 +00:00)
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 100,
n_jobs = -1,
oob_score = True,
bootstrap = True,
random_state = 42)
rf.fit(X_train, y_train)
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:8: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
RandomForestRegressor(n_jobs=-1, oob_score=True, random_state=42)
time: 4min 40s (started: 2021-02-11 00:07:42 +00:00)
# add it
X_valid , y_valid = X_train , y_train
time: 935 µs (started: 2021-02-11 00:12:31 +00:00)
print('R^2 Training Score: {:.2f} \nOOB Score: {:.2f} \nR^2 Validation Score: {:.2f}'.format(rf.score(X_train, y_train),
rf.oob_score_,
rf.score(X_valid, y_valid)))
R^2 Training Score: 0.91 OOB Score: 0.35 R^2 Validation Score: 0.91 time: 1.43 s (started: 2021-02-11 00:12:34 +00:00)
Let's start with decision trees to build some intuition. In decision trees, every node is a condition how to split values in a single feature, so that similar values of dependent variable end up in the same set after the split. The condition is based on impurity, which in case of classification problems is Gini impurity / information gain (entropy), while for regression trees its variance. So when training a tree we can compute how much each feature contributes to decreasing the weighted impurity. feature_importances_ in Scikit-Learn is based on that logic, but in case of Random Forest we are talking about averaging the decrease in impurity over trees.
Pros:
Cons:
base_imp = imp_df(X_train.columns, rf.feature_importances_)
base_imp
| feature | feature_importance | |
|---|---|---|
| 0 | longitNeptune | 0.188950 |
| 1 | laitNeptune | 0.182760 |
| 2 | Lat | 0.071047 |
| 3 | Long | 0.067274 |
| 4 | longitPluto | 0.033988 |
| 5 | laitPluto | 0.026657 |
| 6 | latitM | 0.026003 |
| 7 | disM | 0.025455 |
| 8 | longitM | 0.023918 |
| 9 | dicMercury | 0.020264 |
| 10 | dicCeres | 0.018099 |
| 11 | dicVenus | 0.017553 |
| 12 | dicMars | 0.016492 |
| 13 | dicJupiter | 0.015655 |
| 14 | dicPluto | 0.014708 |
| 15 | dicSaturn | 0.014167 |
| 16 | dicEris | 0.013573 |
| 17 | dicSun | 0.013037 |
| 18 | dicNeptune | 0.012876 |
| 19 | dicUranus | 0.012470 |
| 20 | dicChiron | 0.011985 |
| 21 | laitCeres | 0.010971 |
| 22 | longitCeres | 0.010015 |
| 23 | longitChiron | 0.009997 |
| 24 | laitChiron | 0.009799 |
| 25 | laitMars | 0.009764 |
| 26 | longitVenus | 0.009739 |
| 27 | longitMars | 0.009446 |
| 28 | laitVenus | 0.009387 |
| 29 | laitMercury | 0.009351 |
| 30 | longitMercury | 0.009036 |
| 31 | longitJupiter | 0.008216 |
| 32 | laitJupiter | 0.008049 |
| 33 | laitEris | 0.007943 |
| 34 | longitEris | 0.007683 |
| 35 | longitSun | 0.007661 |
| 36 | longitSaturn | 0.007530 |
| 37 | laitSun | 0.007486 |
| 38 | laitSaturn | 0.007077 |
| 39 | longitUranus | 0.007001 |
| 40 | laitUranus | 0.006919 |
time: 124 ms (started: 2021-02-11 00:12:43 +00:00)
var_imp_plot(base_imp, 'Default feature importance (scikit-learn)')
time: 714 ms (started: 2021-02-11 00:12:45 +00:00)
It seems that the top 3 most important features are:
What seems surprising though is that a column of random values turned out to be more important than:
Intuitively this feature should have zero importance on the target variable. Let's see how it is evaluated by different approaches.
This approach directly measures feature importance by observing how random re-shuffling (thus preserving the distribution of the variable) of each predictor influences model performance.
The approach can be described in the following steps:
Pros:
Cons:
feature_importancesAs for the second problem with this method, I have already plotted the correlation matrix above. However, I will use a function from one of the libraries I use to visualise Spearman's correlations. The difference between standard Pearson's correlation is that this one first transforms variables into ranks and only then runs Pearson's correlation on the ranks.
Spearman's correlation:
!pip install rfpimp
Collecting rfpimp Downloading https://files.pythonhosted.org/packages/3f/ab/0fe16e849f21ab5462a227827cc1c67475609573e48428beec995251566b/rfpimp-1.3.7.tar.gz Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from rfpimp) (1.19.5) Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from rfpimp) (1.1.5) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from rfpimp) (0.24.1) Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from rfpimp) (3.2.2) Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->rfpimp) (2018.9) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas->rfpimp) (2.8.1) Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->rfpimp) (1.4.1) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->rfpimp) (2.1.0) Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->rfpimp) (1.0.0) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->rfpimp) (0.10.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->rfpimp) (2.4.7) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->rfpimp) (1.3.1) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.7.3->pandas->rfpimp) (1.15.0) Building wheels for collected packages: rfpimp Building wheel for rfpimp (setup.py) ... done Created wheel for rfpimp: filename=rfpimp-1.3.7-cp36-none-any.whl size=10670 sha256=07bed60f49959f6715d1102b0de4fcf327f2b1aae4a384690e63ff19bedae808 Stored in directory: /root/.cache/pip/wheels/16/08/23/9d90df482c9c943df4d6fe874c0937f8e81dc3db917c9300c2 Successfully built rfpimp Installing collected packages: rfpimp Successfully installed rfpimp-1.3.7 time: 4.25 s (started: 2021-02-11 00:13:09 +00:00)
from rfpimp import plot_corr_heatmap
viz = plot_corr_heatmap(X_train, figsize=(15,10))
viz.view()
time: 10.6 s (started: 2021-02-11 00:13:15 +00:00)
I found two libraries with this functionality, not that it is difficult to code it. Let's go over both of them as they have some unique features.
There are a few differences from the basic approach of rfpimp and the one employed in eli5.
Some of them are:
cv and refit connected to using cross-validation. In this example I set them to None, as I do not use it but it might come in handy in some cases.metric parameter, which as in rfpimp accepts a function in the form of metric(model, X, y). If this parameter is not specified, the function will use the default score method of the estimator.n_iter - number of random shuffle iterations, the end score is the average !pip install eli5
Collecting eli5
Downloading https://files.pythonhosted.org/packages/d1/54/04cab6e1c0ae535bec93f795d8403fdf6caf66fa5a6512263202dbb14ea6/eli5-0.11.0-py2.py3-none-any.whl (106kB)
|████████████████████████████████| 112kB 4.1MB/s
Requirement already satisfied: attrs>16.0.0 in /usr/local/lib/python3.6/dist-packages (from eli5) (20.3.0)
Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (from eli5) (0.10.1)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from eli5) (1.15.0)
Requirement already satisfied: scikit-learn>=0.20 in /usr/local/lib/python3.6/dist-packages (from eli5) (0.24.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.6/dist-packages (from eli5) (2.11.3)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from eli5) (1.19.5)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.6/dist-packages (from eli5) (0.8.7)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from eli5) (1.4.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.20->eli5) (2.1.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.20->eli5) (1.0.0)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2->eli5) (1.1.1)
Installing collected packages: eli5
Successfully installed eli5-0.11.0
time: 3.04 s (started: 2021-02-11 00:15:13 +00:00)
import eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(rf, cv = None, refit = False, n_iter = 50).fit(X_train, y_train)
perm_imp_eli5 = imp_df(X_train.columns, perm.feature_importances_)
var_imp_plot(perm_imp_eli5, 'Permutation feature importance (eli5)')
time: 19min 15s (started: 2021-02-11 00:15:18 +00:00)
The results are very similar to the previous ones, even as these came from multiple reshuffles per column.
The default importance DataFrame is not the most readable, as it does not contain variable names. This can be of course quite easily fixed. The nice thing is the standard error from all iterations of the reshuffling on each variable.
eli5.show_weights(perm)
| Weight | Feature |
|---|---|
| 0.3367 ± 0.0034 | x24 |
| 0.3225 ± 0.0027 | x23 |
| 0.2427 ± 0.0051 | x0 |
| 0.1706 ± 0.0038 | x1 |
| 0.1706 ± 0.0018 | x27 |
| 0.1272 ± 0.0021 | x28 |
| 0.1162 ± 0.0011 | x26 |
| 0.1157 ± 0.0022 | x37 |
| 0.0922 ± 0.0018 | x18 |
| 0.0710 ± 0.0012 | x17 |
| 0.0672 ± 0.0009 | x34 |
| 0.0565 ± 0.0011 | x25 |
| 0.0540 ± 0.0010 | x19 |
| 0.0506 ± 0.0013 | x39 |
| 0.0501 ± 0.0011 | x22 |
| 0.0469 ± 0.0005 | x33 |
| 0.0459 ± 0.0004 | x36 |
| 0.0459 ± 0.0010 | x13 |
| 0.0449 ± 0.0009 | x10 |
| 0.0425 ± 0.0009 | x40 |
| … 21 more … | |
time: 36.1 ms (started: 2021-02-11 00:54:18 +00:00)
One extra nice thing about eli5 is that it is really easy to use the results of permutation approach to carry out feature selection by using Scikit-learn's SelectFromModel or RFE.
LIME (Local Interpretable Model-agnostic Explanations) is a technique explaining the predictions of any classifier/regressor in an interpretable and faithful manner. To do so, an explanation is obtained by locally approximating the selected model with an interpretable one (such as linear models with regularisation or decision trees). The interpretable models are trained on small perturbations (adding noise) of the original observation (row in case of tabular data), thus they only provide good local approximation.
Some drawbacks to be aware of:
!pip install lime
Collecting lime
Downloading https://files.pythonhosted.org/packages/f5/86/91a13127d83d793ecb50eb75e716f76e6eda809b6803c5a4ff462339789e/lime-0.2.0.1.tar.gz (275kB)
|████████████████████████████████| 276kB 5.6MB/s
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from lime) (3.2.2)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from lime) (1.19.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from lime) (1.4.1)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from lime) (4.41.1)
Requirement already satisfied: scikit-learn>=0.18 in /usr/local/lib/python3.6/dist-packages (from lime) (0.24.1)
Requirement already satisfied: scikit-image>=0.12 in /usr/local/lib/python3.6/dist-packages (from lime) (0.16.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->lime) (1.3.1)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->lime) (2.8.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->lime) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->lime) (0.10.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.18->lime) (2.1.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.18->lime) (1.0.0)
Requirement already satisfied: networkx>=2.0 in /usr/local/lib/python3.6/dist-packages (from scikit-image>=0.12->lime) (2.5)
Requirement already satisfied: pillow>=4.3.0 in /usr/local/lib/python3.6/dist-packages (from scikit-image>=0.12->lime) (7.0.0)
Requirement already satisfied: imageio>=2.3.0 in /usr/local/lib/python3.6/dist-packages (from scikit-image>=0.12->lime) (2.4.1)
Requirement already satisfied: PyWavelets>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from scikit-image>=0.12->lime) (1.1.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.1->matplotlib->lime) (1.15.0)
Requirement already satisfied: decorator>=4.3.0 in /usr/local/lib/python3.6/dist-packages (from networkx>=2.0->scikit-image>=0.12->lime) (4.4.2)
Building wheels for collected packages: lime
Building wheel for lime (setup.py) ... done
Created wheel for lime: filename=lime-0.2.0.1-cp36-none-any.whl size=283846 sha256=62ddb33c8a71b17c76f30017db64d3813d78d332ce8122cf4f2958db5d34c97a
Stored in directory: /root/.cache/pip/wheels/4c/4f/a5/0bc765457bd41378bf3ce8d17d7495369d6e7ca3b712c60c89
Successfully built lime
Installing collected packages: lime
Successfully installed lime-0.2.0.1
time: 4.36 s (started: 2021-02-11 02:59:48 +00:00)
import lime
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values,
mode = 'regression',
feature_names = X_train.columns,
categorical_features=[3],
categorical_names=['CHAS'],
discretize_continuous = True)
time: 1.04 s (started: 2021-02-11 03:00:05 +00:00)
Below you can see the output of LIME interpretation.
There are 3 parts of the output:
Note that LIME has discretized the features in the explanation. This is because of setting discretize_continuous=True in the constructor above. The reason for discretization is that it gives continuous features more intuitive explanations.
np.random.seed(42)
exp = explainer.explain_instance(X_train.values[31], rf.predict, num_features=5)
exp.show_in_notebook(show_all=False) #only the features used in the explanation are displayed
np.random.seed(42)
exp = explainer.explain_instance(X_train.values[85], rf.predict, num_features=5)
exp.show_in_notebook(show_all=False)
time: 5.02 s (started: 2021-02-11 03:00:09 +00:00)